Introduction

We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The survey program has been conducted as a series of surveys designed to assess the health and nutritional status of adults and children in the United States since the 1960s, according to CDC (2023). It combines in-person face-to-face interviews and physical examinations of participants for data collection.

The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.

We aim to study the relationship between the weight variable and the other health related variables of the data.

Method

We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests on some of the variables. Lastly we did a linear regression model fit to the response variable “weight” with other variables and confounders.

Part 1: Exploratory Analysis

We began our analysis by giving a data dictionary of the data shown in Table 1 below. As one can see that some variables have a high percentage of missing values. In Part 2 we made hypothesis tests to decide if some of these variables could be excluded from the regression analysis in Part 3.

The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. As one could see there was an obese variable in the data. The weight variable was categorized by giving a threshold of 35 to the BMI value. A person is considered healthy if the BMI is below 35, and obese otherwise. Therefore, we used the obese variable as the categorical random variable in our project.

Data Variable Definition
Variables Type Example Number.Unique MissingPct Comment
id integer 1, 2, 3 6482 0% Identification Code (1 - 6482)
gender factor Male, Female 2 0% Gender (1: Male, 2: Female)
age integer 34, 16, 60 65 0% Age (Years)
marstat factor Married, NA, Widowed 6 9.7% Marital Status (1: Married, 2: Widowed, 3: Divorced, 4: Separated, 5: Never Married, 6: Living Together)
samplewt numeric 80100.544, 13953.078, 20090.339 2499 0% Statistical Weight (4084.478 - 153810.3)
psu integer 1, 2 2 0% Pseudo-PSU (1, 2)
strata integer 9, 10, 1 15 0% Pseudo-Stratum (1 - 15)
tchol integer 135, 192, 202 251 6.09% Total Cholesterol (mg/dL)
hdl integer 50, 60, 45 112 6.09% HDL-Cholesterol (mg/dL)
sysbp integer 114, 112, 154 61 8.53% Systolic Blood Pressure (mm Hg)
dbp integer 88, 62, 70 40 9.16% Diastolic Blood Pressure (mm Hg)
wt numeric 87.400002, 72.300003, 116.8 957 0.57% Weight (kg)
ht numeric 164.7, 181.3, 166 527 0.57% Standing Height (cm)
bmi numeric 32.22, 22, 42.39 2276 0.57% Body mass Index (Kg/m^2)
vigwrk factor No, Yes, NA 2 0.02% Vigorous Work Activity (1: Yes, 2: No)
modwrk factor No, Yes, NA 2 0.02% Moderate Work Activity (1: Yes, 2: No)
wlkbik factor No, Yes, NA 2 0.02% Walk or Bicycle (1: Yes, 2: No)
vigrecexr factor No, Yes, NA 2 0.02% Vigorous Recreational Activities (1: Yes, 2: No)
modrecexr factor No, Yes, NA 2 0.03% Moderate Recreational Activities (1: Yes, 2: No)
sedmin integer 480, 240, 720 37 1.22% Minutes of Sedentary Activity per Week (0 - 840)
obese factor No, Yes, NA 2 0.57% BMI>35 (1: No, 2: Yes)

According to CDC’s classification on bodyweight, we have: BMI<18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI>30 as obesity. We adopted this category and found that there was a slight positive relationship between bodyweight and the total cholesterol level. However, we noticed that there was a negative relationship between the HDL and bodyweight. Because of the fact that Tchol is the sum of HDL and LDL, we can conclude that the obese population has a high level of LDL and a low level HDL.

According to ATPIII (n.d.), we can also categorize the cholesterol level.

Part 2: Hypothesis Tests

We first test the independence between obesity and marital status. We form the following contingency table:
Contingency Table
Obesity
No Yes
Marital Status Married 2530 474
Widowed 418 86
Divorced 528 112
Separated 158 35
Never Married 863 160
Living Together 388 66

Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:

\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]

We use the chi-squared test to conclude that there is not enough evidence to reject the null hypothesis with a p-value equal to 0.6894. In other words, we cannot conclude that there is a relationship between obesity and marital status.

We do the same test for other variables compared with obesity. From Table 2 we can see that we can reject the independence between obesity and wlkbik, vigrecexr and modrecexr variables.

p-values of Independence Tests between Different Variables and Obesity
vigwrk modwrk wlkbik vigrecexr modrecexr
p-value 0.5695 0.3037 1.064e-07 4.061e-15 2.573e-09

Conclusion

References

2023. https://www.cdc.gov/nchs/nhanes/about_nhanes.htm.
n.d.
G, Zipf, Chiappa M, Porter KS, et al. 2013. “National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010.” National Center for Health Statistics 1 (56).